Skip to content

Conversation

@potter-potter
Copy link
Contributor

@potter-potter potter-potter commented Nov 14, 2025

Note

Adds a Teradata SQL source/destination connector with registration, deps, and unit tests; bumps version; skips Jira integration test; updates e2e fixtures.

  • Connectors (SQL):
    • Teradata: New teradata source/destination in processes/connectors/sql/teradata.py with ConnectionConfig, Downloader, Uploader, and UploadStager (identifier quoting, TOP syntax, qmark params, JSON conversion for lists/dicts).
    • Registered entries in processes/connectors/sql/__init__.py.
  • Dependencies:
    • Added optional dep group teradata in pyproject.toml and requirements/connectors/teradata.txt.
  • Tests:
    • Added unit tests in test/unit/connectors/sql/test_teradata.py.
    • Skipped Jira integration test (test/integration/connectors/test_jira.py).
    • Updated E2E expected JSON fixtures under test_e2e/expected-structured-output/s3/.
  • Versioning/Docs:
    • Bump to 1.2.22 and update CHANGELOG.md.

Written by Cursor Bugbot for commit 127ee65. This will update automatically on new commits. Configure here.

@awalker4
Copy link
Contributor

awalker4 commented Nov 18, 2025

I hit an encoding error when trying to add the winter-sports.epub sample doc. It looks like we can add charset: "UTF8" to the conn_params, but I'm not sure how we usually address this with other connectors.

From cursor:

I checked our SQL connectors; none explicitly set client encoding. Postgres “just works” because the stack is Unicode by default.
Postgres: connection has no charset setting; psycopg negotiates client_encoding to the DB’s encoding (commonly UTF8), and text/varchar store full Unicode.
postgres.pyLines 58-66
connection = connect(    user=self.username,    password=access_config.password,    dbname=self.database,    host=self.host,    port=self.port,)
SQLite: Python’s sqlite3 passes Python str as Unicode; SQLite stores TEXT as UTF‑8/UTF‑16.
Snowflake: NVARCHAR/VARIANT are Unicode; connector handles UTF‑8 by default.
SingleStore: modern drivers default to utf8mb4 unless overridden.
Teradata is the outlier: if the session is LATIN or columns are CHAR/VARCHAR … CHARACTER SET LATIN, Unicode characters (curly quotes/emoji) trigger 6706. So we need to (a) set the Teradata session charset to UTF‑8 and (b) ensure target columns are UNICODE.

2025-11-18 16:14:48,937 MainProcess ERROR    Exception raised while running upload
Traceback (most recent call last):
  File "/Users/austin/repos/unstructured-ingest/unstructured_ingest/pipeline/interfaces.py", line 171, in run_async
    return await self._run_async(fn=fn, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/repos/unstructured-ingest/unstructured_ingest/pipeline/steps/upload.py", line 53, in _run_async
    fn(**fn_kwargs)
  File "/Users/austin/repos/unstructured-ingest/unstructured_ingest/processes/connectors/sql/sql.py", line 456, in run
    self.upload_dataframe(df=df, file_data=file_data)
  File "/Users/austin/repos/unstructured-ingest/unstructured_ingest/processes/connectors/sql/teradata.py", line 226, in upload_dataframe
    cursor.executemany(stmt, values)
  File "/Users/austin/repos/unstructured-ingest/.venv/lib/python3.12/site-packages/teradatasql/__init__.py", line 1054, in executemany
    raise OperationalError (sErr)
teradatasql.OperationalError: [Version 20.0.0.47] [Session 1269] [Teradata Database] [Error 6706] The string contains an untranslatable character.

@potter-potter
Copy link
Contributor Author

potter-potter commented Nov 18, 2025

I hit an encoding error when trying to add the winter-sports.epub sample doc. It looks like we can add charset: "UTF8" to the conn_params, but I'm not sure how we usually address this with other connectors.

From cursor:

I checked our SQL connectors; none explicitly set client encoding. Postgres “just works” because the stack is Unicode by default.
Postgres: connection has no charset setting; psycopg negotiates client_encoding to the DB’s encoding (commonly UTF8), and text/varchar store full Unicode.
postgres.pyLines 58-66
connection = connect(    user=self.username,    password=access_config.password,    dbname=self.database,    host=self.host,    port=self.port,)
SQLite: Python’s sqlite3 passes Python str as Unicode; SQLite stores TEXT as UTF‑8/UTF‑16.
Snowflake: NVARCHAR/VARIANT are Unicode; connector handles UTF‑8 by default.
SingleStore: modern drivers default to utf8mb4 unless overridden.
Teradata is the outlier: if the session is LATIN or columns are CHAR/VARCHAR … CHARACTER SET LATIN, Unicode characters (curly quotes/emoji) trigger 6706. So we need to (a) set the Teradata session charset to UTF‑8 and (b) ensure target columns are UNICODE.
2025-11-18 16:14:48,937 MainProcess ERROR    Exception raised while running upload
Traceback (most recent call last):
  File "/Users/austin/repos/unstructured-ingest/unstructured_ingest/pipeline/interfaces.py", line 171, in run_async
    return await self._run_async(fn=fn, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/austin/repos/unstructured-ingest/unstructured_ingest/pipeline/steps/upload.py", line 53, in _run_async
    fn(**fn_kwargs)
  File "/Users/austin/repos/unstructured-ingest/unstructured_ingest/processes/connectors/sql/sql.py", line 456, in run
    self.upload_dataframe(df=df, file_data=file_data)
  File "/Users/austin/repos/unstructured-ingest/unstructured_ingest/processes/connectors/sql/teradata.py", line 226, in upload_dataframe
    cursor.executemany(stmt, values)
  File "/Users/austin/repos/unstructured-ingest/.venv/lib/python3.12/site-packages/teradatasql/__init__.py", line 1054, in executemany
    raise OperationalError (sErr)
teradatasql.OperationalError: [Version 20.0.0.47] [Session 1269] [Teradata Database] [Error 6706] The string contains an untranslatable character.

oh. good find. Teradata is a real stickler for this type of stuff. Seems like something to warn on. I believe the UNICODE would need to be set in the table on creation. So that is a user side setup that we would call out in the Documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants